Unsupervised Learning of Syntactic Knowledge: Methods and Measures
نویسندگان
چکیده
Supervised methods for ambiguity resolution learn in "sterile" environments, in absence of syntactic noise. However, in many language engineering applications manually tagged corpora are not available nor easily implemented. On the other side, the "exportability" of disambiguation cues acquired from a given, noise-free, domain (e.g. the Wall Street Journal) to other domains is not obvious. Unsupervised methods of lexical learning have, just as well, many inherent limitations. First, the type of syntactic ambiguity phenomena occurring in real domains are much more complex than the standard V N PP patterns analyzed in literature. Second, especially in sublanguages, syntactic noise seems to be a systematic phenomenon, because many ambiguities occur within identical phrases. In such cases there is little hope to acquire a higher statistical evidence of the correct attachment. Class-based models may reduce this problem only to a certain degree, depending upon the richness of the sublanguage, and upon the size of the application corpus. Because of these inherent difficulties, we believe that syntactic learning should be a gradual process, in which the most difficult decisions are made as late as possible, using increasingly refined levels of knowledge. In this paper we present an incremental, class-based, unsupervised method to reduce syntactic ambiguity. We show that our method achieves a considerable compression of noise, preserving only those ambiguous patterns for which shallow techniques do not allow reliable decisions. Unsupervised vs. supervised models of syntactic learning Several corpus-based methods for syntactic ambiguity resolution have been recently presented in the literature. In (Hindle and Rooth, 1993) hereafter H&R, lexicalized rules are derived according to the probability of noun-preposition or verb-preposition bigrams for ambiguous structures like verb-noun-preposition-noun sequences. This method has been criticised because it does not consider the PP object in the attachment decision scheme. However collecting bigrams rather than trigrams reduces the well known problem of data sparseness. In subsequent studies, trigrams rather than bigrams were collected from corpora to derive disambiguation cues. In (Collins and Brooks,1995) the problems of data sparseness is approached with a supervised back-off model, with interesting results. In (Resnik and Hearst, 1993) class-based trigrams are obtained by generalizing the PP head, using WordNet synonymy sets. In (Ratnaparkhi et al, 1994) word classes are derived automatically with a clustering procedure. (Franz, 1995) uses a loglinear model to estimate preferred attachments according to the linguistic features of co-occurring words (e.g. bigrams, the accompanying noun determiner, etc.). (Brill and Resnik, 1994) use transformationbased error-driven learning (Brill, 1992) to derive disambiguation rules based on simple context information (e.g. right and left adjacent words or POSs). All these approaches need extensive collections of positive examples (i.e. hand corrected attachment instances) in order to trigger the acquisition process. Probabilistic, backed-off or loglinear models rely entirely on noise-free data, that is, correct parse trees or bracketed structures. In general the training set is the parsed Wall Street Journal (Marcus et al, 1993), with few exceptions, and the size of the training samples is around 10-20,000 test cases. Some methods do not require manually validated PP attachments, but word
منابع مشابه
Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures
Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...
متن کاملPresentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures
Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...
متن کاملUse of unsupervised word classes for entity recognition: Application to the detection of disorders in clinical reports
Unsupervised word classes induced from unannotated text corpora are increasingly used to help tasks addressed by supervised classification, such as standard named entity detection. This paper studies the contribution of unsupervised word classes to a medical entity detection task with two specific objectives: How do unsupervised word classes compare to available knowledge-based semantic classes...
متن کاملAn Unsupervised Learning Method for an Attacker Agent in Robot Soccer Competitions Based on the Kohonen Neural Network
RoboCup competition as a great test-bed, has turned to a worldwide popular domains in recent years. The main object of such competitions is to deal with complex behavior of systems whichconsist of multiple autonomous agents. The rich experience of human soccer player can be used as a valuable reference for a robot soccer player. However, because of the differences between real and simulated soc...
متن کاملUnsupervised Learning of Morphology by using Syntactic Categories
This paper presents a method for unsupervised learning of morphology that exploits the syntactic categories of words. Previous research [4][12] on learning of morphology and syntax has shown that both kinds of knowledge affect each other making it possible to use one type of knowledge to help the other. In this work, we make use of syntactic information i.e. Part-of-Speech (PoS) tags of words t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1996